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Abstract. The clustering objects has become one of themes in many 
studies, and do not few researchers use the similarity to cluster the in- 
stances automatically. However, few research consider using Kommogorov 
Complexity to get information about objects from documents, such as 
Web pages, where the rich information from an approach proved to be 
difficult to. In this paper, we proposed a similarity measure from Kol- 
mogorov Complexity, and we demonstrate the possibility of exploiting 
features from Web based on hit counts for objects of Indonesia Intellec- 
tual. 

Keywords: Kolmogorov complexity, distance, similarity, singleton, dou- 
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1 Introduction 

In mathematics, the object is an abstract arising in mathematics, generally is 
known as mathematical object. Commonly they include numbers, permutations, 
partitions, matrices, sets, functions, and relations. In computer science, these 
objects can be viewed as binary strings, or strings in forms are words, sentences 
or documents. Thus we will refer to objects and string interchangeably in this 
paper. Therefore, sometimes some research also will refer to data as objects or 
objects as data. 

A binary string has the length of the shortest program which can output 
the string on a universal Turing machine and then stop [I]. A universal Turing 
machine is an idealized computing device capable of reading, writing, processing 
instructions and halting [2 3 . The concept of Turing machine is widely used in 
theoretical computer science, as computational model based on mathematics to 
approach some problems of real-world. One of problems is about word sense, 
mainly about context. This problem appears in some applications like machine 
translation and text summarization, where mostly the existing system needs to 
understand the correct meaning (semantics relation) and function of words in 
natural language. This means that the aquasition of knowledge needs a model to 
abstracts an incomplete information. Therefore, this paper is to address a tool 
of measurement based on Kolmogorov complexity for finding relations among 
objects. We first review, in Section 2, the basic terminologies and the concepts. 
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We state, in Section 3, the fundamental results and we discussion property of 
similarity in Lemma and Theorem. In Section 4, we study a set of objects from 
Indonesia intellectuals. 

2 Related Work 

In mathematics, it is more important that objects be definable in some uniform 
way, for example as sets. Regardless of actual practice, in order to lay bare the 
essence of its paradoxes, which has traditionally accorded the management of 
paradox higher priority to objects, and it needs the faithful reflection of the 
details of mathematical practice as a justification for defining objects. Turing 
showed this problem in his famous work on the halting problem that it is im- 
possible to write a computer program which is able to predict if some other 
program will halt [415] . Thus it is impossible to compute the complexity of a 
binary string. However there have been methods developed to approximate it, 
and Kolmogorov complexity is of length of the shortest program which can out- 
put the string, where objects can be given literally such as the human can be 
represented in DNA [5]. 

Kolmogorov complexity, also known as algorithm entropy, stochastic com- 
plexity, descriptive complexity, Kolmogorov- Chaitin complexity and program- 
size complexity, is used to describe the complexity or degree of randomness of a 
binary string. It was independently developed by Audrey N. Kolmogorov, Ray 
Solomonoff and Gregory Chaitin in the late 1960's |7I5) . For an introduction and 
details see the textbook [5]. 

Definition 1. The Kolmogorov complexity of a string x, denoted as K{x), is 
the length, in bits, of the shortest computer program of the fixed reference com- 
puting systems that produces x as output. 

The choice of computing system changes the value of K{x) by at most an 
additive fixed constant. Since K{x) oo, this additive fixed constant is an ignor- 
able quantity if x is large. One way to think about the Kolmogorov complexity 
Kix) is to view it as the length (bits) of the ultimate compressed version from 
which X can be recovered by a general decompression program. The associated 
compression algorithm transform Xz back into a; or a string very close to x. A 
loss compression algorithm is one in which the decompression algorithm exactly 
computes x from Xz and a loss compression algorithm is one which x can be 
approximated from given Xz- Usually, the length \xz \ < \x\. Using a better com- 
pressor results in Xb with no redundant information, usually \xb\ < \xz\, etc. So, 
loss compression algorithms are used when there can be no loss of data between 
compression and decompression. When K{x) is approximation corresponds to 
an upper-bound of K{x) [5]. Let C be any compression algorithm and let C{x) 
be the results of compressing x using C. 
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Definition 2. The approximate Kolmogorov complexity of x, using C as a com- 
pression algorithm, denoted Kc{x), is 

Length(C(x)) |C(x)| 
Lengtn(a;) |a;| 

where q is the length in bits of the program which implements C . 

If C was able to compress x a great deal then Kc (x) is low and thus x has low 
complexity. Using this approximation, the similarity between two finite objects 
can be compared |10l9j . 

Definition 3. The information shared between two string x andy, denoted I{x : 
y), is I{x : y) — K[y) — K{y\x), where K{y\x) is Kolmogorov complexity of y 
relative to x, is the length of the shortest program which can output y if K(x) is 
given as additional input to the program. 



Table 1. Data compression 
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Previous classification research using Kolmogorov complexity has been based 
on the similarity metric developed jllll2j . Two strings which are similar share 
patterns and can be compressed more when concatenated than separately. In 
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this way the similarities between data can be measured. This method has been 
successfully used to classify documents, music, email, and those are of: network 
traffic, detecting plagiarism, computing similarities between genomes and track- 
ing the evaluation of chain letters |13I14I15I16I17I18] . 

3 Distance, Metric and Similarity 

Suppose there is a pattern matching algorithm based on compressing each con- 
secutive set of four binary digits (hexadecimal). Let C is the program that per- 
forms this compression. For each string w, C generates a key of single characters 
which corresponding to sets of four digits. Let si = "&o^i^i&o&i^i^i^o" will gen- 
erate keys ki = bobibibo and k2 = bibibibQ. The compressed string is composed 
of the representation plus the key, i.e. fcifc2 + ''ki = 6o^i^o&4 ^2 = bibibibo" . 
Suppose a second string S2 = &o^i^i^o^i^i^o^o and keys are ki — bgbibibo 
and fcs = bibibobo, and then the compressed string of S2 is ^1^3 + "^i = 
bobibob4 /cs = bibibobo. We can write C{si\s2) = kik2 +"k2 = bibibibo" ■ Thus 
|C'(si|s2)| < 1(^(51)1 because there is a similar pattern in si and S2- For example, 
we have three strings 

si = 0100 1101 0100 0001 0100 1000 0101 1010 0101 0101, 

52 = 0100 0100 0100 0100 0100 1001 0100 1110, and 

53 = 1001 1010 1001 1001 0100 0100 0100 1001. 

We can compress each string individually and also the results of compressing si 
using the keys already developed for S2 and S3, Table 1. 

Ic{s2 ■■ si) = Kp{si) - Kp{si\s2) = 0.85 - 0.75 = 0.10 
Iclss : si) = Kp{si) - Xp(si|s3) = 0.85 - 0.55 = 0.30 

Thus Iciss : si) > /c(s2 : si) is that si and S3 share more information than 
Si and S2. This defines that the information shared between two strings can be 
approximated by using a compression algorithm C. Therefore, the length of the 
shortest binary program in the reference universal computing system such that 
the program computes output y from input x, and also ouput x from input y, 
called information distance [19111112] . 

Definition 4. Let X be a set. A function E : X x X ^ II is called in- 
formation distance (or dissimilarity) on X, denoted E{x,y), i.e. E{x,y) = 
K(x\y) — vxa\{K{x\K{y)\ for all x,y G X, it holds: 

1. E{x,y) > 0, (non-negativity); 

2. E{x,y) — E(y,x), (symmmetry) and; 

3. E(^x,y) < E(x, z) + E(z,y), (transitivity). 

This distance E{x,y) is actually a metric, but on properties of information 
distance these distances that are nonnegative and symmetric, i.e. for considering 
a large class of admissible distances, whereas computable in the sense that for 
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every such distance J there is a prefix program that has binary length equal to 
the distance D{x,y) between x and y. This means that 

E{x,y) < D{x,y) + cd 

where cd is a constant that depends only on D but not on a; and y. Therefore, 
there are some distances related to one another with features that because it is 
not suitable. Thus we need to normalize the information distance. 

Definition 5. Normalized information distance, denoted N{x\y), is 

^ K{x\y)^mm{K{x),K{y)} 
^ iTLayi{K{x),K{y)) 

such that N{x\y) £ [0,1]. 

Analogously, if C is a compressor and we use C{x) to denote the length of 
the compressed version of a string x, we define normalized compression distance. 

Definition 6. Normalized compression distance, denoted Nc{x\y), is 

^_ C{xy)-min{C{x),C{y)} 
max{C(x),C(y)} 

where for convenience the pair {x\y) is replaced by the concatenation xy. 

From TableHl we calculate iVc(si I S2) = = 0.294118, whereas A^c(si jsa) = 

-0.058824. 



34 

The string give a name to object, like "the three- letter genome of 'love'" or 
"the text of The Da Vinci Code by Dan Brown", also there are objects that 
do not have name literally, but acquire their meaning from their contexts in 
background common knowledge in humankind, like "car" or "green" . The objects 
are classified by word, the words as objects are classified in the sentences where 
it represented how the society used the objects, and the words and the sentences 
are classified in documents. 

Definition 7. W — {wi, . . . ,Wy} represents the number of unique words (i.e., 
vocabulary) and a word as grain of vocabulary indexed by {1, . . . , u}. 

Definition 8. A document d is a sequence of n words denoted by w = {wi\i = 
1, . . . ,n}, where Wn denotes the nth word in a document. 

Definition 9. A corpus is a collection of m documents denoted by D ^ {d-j\j = 
1, . . . , m}, where dm denotes the mth document in a corpus. 

In real world, the corpus is divided two kind: annotated corpus and large 
corpus. The last definition is a representation of body of information physically 
limited by designing capacity for managing documents. Unfortunately, the mod- 
elling collection of document as the annotated corpus not only need more times 



6 M. K. M. Nasution 



and much cost to construct and then to manage it, but also this modeUing eUm- 
inate dynamic property from it. Other side, the collection of digital documents 
on Internet as web have been increased extremely and changed continuously, and 
to access them generally based on indexes. 

Let the set of document indexed by system tool be J7, where its cardinality 
is In our example, fi — {fci,...,fc8}i and \C2\ — 13. Let every term x 
defines singleton event x C i? of documents that contain an occurence of x. Let 
P : 12 — j> [0, 1] be the uniform mass probability function. The probability of 
event x is /"(x) — |x|/|i7|. Similarly, for terms x AND y, the doubleton event 
X n y C i? is the set of documents that contain both term x and term y (co- 
occurrence), where their probability together is P(x fl y) = |x fl y|/|/2|. Then, 
based on other Boolean operations and rules can be developed their probability 
of events via above singleton or doubleton. From Table 1, we know that term 
ki has \ki\ = 3 in si, \ki\ = 6 in S2 and jfeaj = 3 in S3. Probability of event 
ki is P{ki) — 3/13 ~ 0.230769 because term ki is occurence in three string as 
document. Probability of event {ki,k^} is P{{ki,k^}) = 2/13 = 0.153846 from 
si dan S3. 

It has been known that the strings x where the complexity C{x) represents 
the length of the compressed version of x using compressor C, for a search term 
X, search engine code of length S{x) represents the shortest expected prefix-code 
word length of the associated search engine event x. Therefore, we can rewrite 
the equation on Definition [S] as 

S{x\y) -\mn{S{x),S{y)} 
''^^"^'^^ max{^(x),5(2/)} ' 

called normalized search engine distance. 

Let a probability mass function over set {{a;,y} : x,?/ S 5} of searching 

terms by search engine based on probability events, where S is universal of 

singleton term. There are \S\ singleton terms, and 2-combination of |5| doubleton 

consisting of a pair of non-identical terms, x ^ y, {x, y} C S. Let z G x fl y, if 

x = xflx and y = yfly, then z G xHx and z G yfly. For ^ = y}cs l^l^yL it 

means that \^\ > |/2|, or < a\n\, a is constant of search terms. Consequently, 

we can define p{x) = "^^j^j^^ ~ j^, and for x = xHx, we have p{x) — "^^j^j^^ = 

P(xnx)|fi| f s / N Ixnxl 
=P(x,x) OTp(x,x) = -L^. 

For P(x|y) means a conditional probability, so p{x) = p{x\x) and p{x\y) = 

P(x n y)|i7|/|tZ'|. Let {fci, k^} is a set, there are three subsets contain fci or k^: 

{fci}, {^5}, and {fci, fcs}. Let we define an analogy, where S{x) and S{x\y) mean 

p{x) and p{x\y). Based on normalized search engine distance equation, we have 

Mr.f'r ,A — |xny|/|i;'hmin(|x|/|'i'|,|y|/|'i'|) 

I\S[X,y)- max(|x|/|i'My|/|i'|) /iN 

|xny I — min( |x| , |y I ) V / 

max( |x| .\y\) 

Definition 10. Let X be a set. A function s : X x X ^ H is called similarity 
(or proximity) on X if s is non-negative, symmetric, and if s{x,y) < s{x,x), 
Vx, y € X, with an equality if and only if x = y. 



Kolmogorov Complexity: Clustering Objects and Similarity 7 



Lemma 1. If x, y e X , s(a;, y) — is a minimum weakest value between x and y 
and s{x,y) = 1 is a maximum strongest value betweem x and y, then a function 
s : X y. X ^ [Q,!], such that Vx,y £ X, s{x, y) £ [0, 1]. 

Proof. Let \X\ is a cardinality of X, and \x\ is a number of x occured in X, the 
ratio between X and a; is < |x|/|X| < 1, where |x| < |X|. 

The s{x^ x) means that a number of x is compared with x-self, i.e. |x|/|x| = 1, 
or Vx G X, |X|/|X| = 1. Thus 1 e [0, 1] is a closest value of s{x,x) or called a 
maximum strongest value. 

In other word, let z ^ X, |z| = means that a number of z do not occur 
in X, and the ratio between z and X is 0, i.e., |z|/|X| = 0. Thus G [0, 1] is a 
unclosest value of s(x,z) or called a minimum weakest value. 

The s{x,y) means that a ratio between a number of x occured in X and a 
number of y occured in X, i.e., |x|/|X| and |y|/|X|, x,y E X. If |X| = |x| + |y|, 
then |x| < |X| and |y| < |X|, or (|x|/|X|)(|y|/|X|) = |x||y|/|X|2 < 1 and 
|x||y|/|X|2 > 0. Thus s(x,y) G [0,1], Vx,y G X. 

Theorem 1. Vx,?/ G X, the similarity of x and y in X is 

six,y) = [^,•^1 +c 

|x| + |y| 

where c is a constant. 

Proof. By Definition U] and Definition [TUl the main transforms is used to obtain 
a distance (dissimilarity) d from a similarity s are d = 1 — s, and from ([T|) we 
obtain l-s= 

max( |x| , |y I ) 

Based on Lemma[l] for maximum value of s is 1, we have = l^'^yl~'"'"([^['lyl) 

niax^|x|,|y|j 

or |x n y| = min(|x|, |y|). For minimum value of s is 0, we obtain 

^ ^ |xny| - min(|x|, |y|) 
max(|x|, |y|) 

or 

|x n y| = max(|x|, |y|) + min(|x|, |y|) 

= |x| + |y| 

or 1 = (|xny|)/(|x| + |y|). We know that |x| + |y| > |xny|, because their ratios are 
not 1. If a; = y, then |xny| — |x| = |y|, its consequence is 1 = (2|xny|)/(|x| + |y|). 
Therefore, we have s — + 1, and c = 1, or 

2|xny| , 

S = -, — ; ; — r + C. 



For normalization, we define |x| = log/(x) and 2|xny| = \og{2 f {x , y)) , and 
the similarity on Definition [TT] satisfies Theorem [TJ 



8 M. K. M. Nasution 



Definition 11. Let similarity metric I is a function s{x,y) : X x X [0, 1], 
x^y d X . We define similarity metric M as follow: 



log(/(.T)+/(y)) 

In [12], they developed Google similarity distance for Google search engine 
results based on Kolmogorov complexity: 

j^^rji ^ ^ maxjlog /(x), log f{y)}- log f{x,y) 
^""'^^ logiV-min{log/(x),log/(y)} 

For example, at the time, a Google search for "horse", returned 46,700,000 hits, 
for "rider" was returned 12,200,000 hits, and searching for the pages where both 
"rider" and "rider" occur gave 2,630,000. Google indexed N = 8,058,044,651 
web pages, and NGD{horse, rider) « 0.443. Using equation in Defenition 10, we 
have (s, y) « 0.865, about two times the results of Google similarity distance. At 
the time of doing the experment, we have 150,000,000 and 57,000,000 for "horse" 
and "rider" from Google, respectively. While the number of hits for the search 
both terms "horse" AND "rider" is 12,400,000, but we wih not have N exactly, 
aside from predicting it. We use similarity metric M for comparing returned 
results of Google and Yahoo!, Tabled] 



Table 2. Similarity for two results. 



Search engine 


a; "horse") 1/ "rider") a; AND y 


s{x,y) 


Google 
Yahoo! 


150,000,000 57,000,000 12,400,000 
737,000,000 256,000,000 52,000,000 


0.889187 
0.891084 



4 Application and Experiment 

Given a set of objects as points, in this case a set of authors of Indonesian 
Intellectuals from Commissie voor de Volkslectuur and their works (Table 12]), 
and a set of authors of Indonesian Intellectuals from New Writer with their 
works (TabelE]). 

The authors of Commissie voor de Volkslectuur are a list of 9 person names: 

{(1) Merari Siregar; (2) Marah Roesli; (3) Muhammad Yamin; (4) Nur Sutan 
Iskandar; (5) Tulis Sutan Sati; (6) Djamaluddin Adinegoro; (7) Abas Soetan 
Pamoentjak; (8) Abdul Muis; (9) Aman Datuk Madjoindo}. 

While the authors of New Writer are 12 peoples, i.e.. 
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Table 3. Indonesian Intellectual of Commissie voor de Volkslectuur 



id Name of Indcsian Intellectual Year Author Value Type 







1920 


1 


0.7348 


7 


b. 




1931 


1 


6569 
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c. 


(linl^fl Han T-Tawfl IVafciii 
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0.4357 
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d. 




1922 


2 


0.5706 


g 


G. 


La Hami 


1924 


2 


0.3831 


4 






1956 


2 


0.5461 


5 


g. 


Tanah Air 


1922 


3 


0.6758 


7 


h. 


Indonesia, Tumpah Darahku 


1928 


3 


0.5183 


5 


i. 


Kalau Dewi Tara Sudah Berkata 




3 


0.4582 


5 


j- 


Ken arok dan Ken Dcdcs 


1934 


3 


0.4922 


5 


k. 


Apa Dayaku karena Aku Seorang 


Perempuan 1923 


4 


0.5374 


5 


1. 


Cinta yang Membawa Maut 


1926 


4 


0.8189 


8 


m. 


Salah Pilih 


1928 


4 


0.7476 


7 


n. 


Karena Mentua 


1932 


4 


0.6110 


6 


o. 


Tuba Dibalas dengan Susu 


1933 


4 


0.5918 


6 


P- 


Hulubalang Raja 


1934 


4 


0.7759 


7 


q- 


Katak Hendak Menjadi Lembu 


1935 


4 


0.8424 


8 


r. 


Tak Disangka 


1923 


5 


0.4811 


5 


s. 


Sengsara Membawa Nikmat 


1928 


5 


0.6006 


6 


t. 


Tak Mcrnbalas Guna 


1932 


5 


0.5139 


5 


u. 


Memutuskan Pertalian 


1932 


5 


0.6150 


6 


V. 


Darah Muda 


1927 


6 


0.3632 


4 


w. 


Asmara Jaya 


1928 


6 


0.3896 


4 


X. 


Pcrtcmuan 


1927 


7 


0.2805 


2 


y- 


Salah Asuhan 


1928 


8 


0.7425 


7 


z. 


Pertemuan Djodoh 


1933 


8 


0.4376 


4 


aa. 


Menebus Dosa 


1932 


9 


0.4531 


5 


ab. 


Si Ccbol Rindukan Bulau 


1934 


9 


0.7516 


7 


ac. 


Sampaikan Salamku Kepadanya 


1935 


9 


0.5786 


6 
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Table 4. Indonesian Intellectual of New Writer 



id Name of Indoensian Intelectual Yeax Author Value Type 



A. 


Dian Tak Kunjung Padam 


1932 




0.6372 


6 


B. 


Tebaran Mega (kumpulan sajak) 


1935 




0.6189 


6 


C. 


Layar Tcrkcmbang 


1936 




0.7494 


7 


D. 


Anak Pcrawan di Sarang Penyamun 


1940 




6095 


6 


E. 


Di Bawah Lindungan Ka'bah 


1938 




0.4302 


4 


F. 


Tenggolamnya Kapal van der Wijck 


1939 




0.7245 


7 


G. 


Tuan Dircktur 


1950 




6506 


6 


H. 


Didalam Lembah Kehidoepan 


1940 




0.3723 


4 


I. 


Belenggu 


1940 




0.6007 


g 


J. 


Jiwa Berjiwa 






0.4669 


5 


K. 


Gamelan Djiwa (kumpulan sajak) 


1960 




60,55 


6 


L. 


Djinak-djinak Merpati (sandiwara) 


1950 




0.6378 


6 


M. 


Kisah Antara Manusia (kumpulan cerpen) 1953 


iii 


0.5380 


5 


O. 


Pancaran Cinta 


1926 


iv 


0.5393 


5 


P. 


Puspa Mega 


1927 


iv 


0.5681 


6 


Q. 


Madah Kelana 


1931 


iv 


0.6477 


6 


R. 


Sandhyakala Ning Majapahit 


1933 


iv 


0.6035 


6 


S. 


Kertajaya 


1932 


iv 


0.4872 


5 


T. 


Nyanyian Sunyi 


1937 


v 


0.5249 


5 


U. 


Begawat Gita 


1933 


V 


0.3175 


2 


V. 


Setanggi Timur 


1939 


V 


0.5058 


5 


w. 


Bebasari: toneel dalam 3 pertundjukan 




vi 


0.5918 


6 


X. 


Pertjikan Permenungan 




vi 


0.4988 


5 


Y. 


Kalau Tak Untung 


1933 


vii 


0.3611 


4 


z. 


Pengaruh Keadaan 


1937 


vii 


0.3655 


4 


AA. Ni Rawit Ceti Penjual Orang 


1935 


viii 


0.7906 


8 


AB. 


Sukreni Gadis Bali 


1936 


viii 


0.7492 


7 


AC. 


I Swasta Setahun di Bedahulu 


1938 


viii 


0.7882 


8 


AD. 


Rindoe Dendam 


1934 


ix 


0.6034 


6 


AE. Koliilangan Mcstika 


1935 


X 


0.5132 


5 


AF. 


Karena Kerendahan Boedi 


1941 


xi 


0.8084 


8 


AG. 


Pembalasan 




xi 


0.4057 


4 


AH. Palawija 


1944 


xii 


0.3886 


4 
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{(i) Sutan Takdir Alisjahbana; (ii) Hamka; (ill) Armijn Pane; (iv) Sanusi Pane; 
(v) Tengku Amir Hamzah; (vi) Roestam Effendi; (vii) Sariamin Ismail; (viii) 
Anak Agung Pandji Tisna; (ix) J. E. Tatengkeng; (x) Fatimah Hasan Delais; 
(xi) Said Daeng Muntu; (xii) Karim Halim}. 







i 


V 

V i X 




















i i 


i viii X i 


















123456789 


i i i 


V V i i i X X i i 


a b 


cdefghijkl 


mnopqrstuvwxy 


z a b c 


1 


65675557 


7 5 6 


656666564 


764646445654 


4 


6455465544 


5 5 6 5 6 6 


2 


6 5666556 


656656566564 


564645555654 


5 


545555555 


5 


5 5 6 5 5 5 


3 


55 855284 


665675445447 


546655755546 


6 


4455666576 


7 6 


4 5 4 4 


4 


668 75496 


8 7 6 


786556557 


757756767658 


7 


6678778777 


8 7 5 6 5 6 


5 


7657 6567 


656656776574 


665546555655 


5 


656556565 


5 


5 5 7 5 6 6 


6 


56556 556 


645546755564 


462425455544 


2 


5444444444 


4 4 6 4 5 5 


7 


552455 47 


525525765882 


482424244542 


2 


544424242 


2 


2 2 7 4 8 4 


8 


5589654 5 


7 7 6 


785555458 


7 4 


8776856558 


7 


556787868 


8 


8 7 4 7 4 6 


9 


76467675 


6 4 6 


646776782 


7 5 


8454544565 


4 


2645545444 


4 4 


8 5 7 6 


i 


766866576 


6 8 


856666564 


6 5 


5646555655 


5 


555555565 


5 


6 6 


5 5 5 5 


ii 


556754274 


6 5 


564444247 


6 4 


7656745447 


7 


546666757 


6 


8 6 


4 6 2 5 


iii 


665665566 


8 5 


856666564 


6 6 


5545555655 


4 


555545555 


5 


5 5 


6 5 5 5 


iv 


666765576 


8 5 8 


56566565 


5 5 


5545555655 


5 


555555555 


5 


6 5 


5 5 5 5 


V 


557854284 


5 6 5 


5 5455457 


5 4 


6545645546 


6 


546665656 


6 


7 6 


4 5 4 5 


vi 


665666556 


6 4 6 


65 666564 


5 6 


4545455654 


4 


545445454 


4 


4 4 


6 4 5 5 


vii 


654577757 


6 4 6 


546 66774 


4 7 


4425255554 


2 


555444454 


4 


4 4 


7 4 7 5 


viii 


664575657 


6 4 6 


6566 6674 


5 7 


4545445652 


2 


645545454 


4 


4 4 


7 4 6 5 


ix 


665665556 


6 4 6 


65666 562 


5 6 


4545455654 


4 


555545454 


4 


4 4 


6 4 5 5 


X 


554555847 


5 2 5 


545766 72 


4 7 


4424244544 


2 


544424242 


2 


2 4 


7 4 7 4 


xi 


664576858 


6 4 6 


6567767 2 


5 9 


4545245652 


2 


645545454 


4 


4 4 


8 4 7 6 


xii 


447744282 


4 7 4 


57444622 


5 2 


6564744246 


6 


445566657 


7 


7 6 


2 5 2 4 



Fig. 1: Matrix of relations. 



In a space provided with a distance measure, we extract more information 
from Web using Yahoo! search engines, then we build the associated distance 
matrix which has entries the pairwise distance between the objects laying on 
Definition [TT] We define some type of relations between author and his/her 
works in 9 categories: (1) unclose (value < 0.11), (2) weakest (0.11 < value 
< 0.22), (3) weaker (0.22 < value < 0.33), (4) weak (0.33 < value < 0.44), 
(5) midle (0.44 < value < 0.56), (6) strong (0.56 < value < 0.67), (7) stronger 
(0.67 < value < 0.78), (8) strongest (0.78 < value < 0.89), and (9) close (value 
> 0.89). 

Specifically, some of Indonesia intellectuals of Commissie voor de Volkslec- 
tuur and New Writer be well-known because their works, mainly the works from 
famous authors which are popularity in society, but also there are visible works 
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because its name is familiar (or same name), for example the story of "Begawat 
Gita" from Tengku Amir Hanizah, or because the given name frequently ap- 
pear as words in work of other people or web pages, for example the story of 
"Pertemuan" from Abas Soetan Pamoentjak, see Table |3] and Tabled 

Generally, the appearance of strong interactions in web pages among Com- 
missie voor de Volkslectuur and New Writer. This situation derive from the time 
the works appear in the same range of years, or adjacent. In other words, we 
know that New Writer is the opposition idea of Commissie voor de Volkslectuur 
[20] . so in any discussion about Indonesia intellectuals, the both always contested 
and discussed together, see Fig. 1. 

5 Conclusions and Future Work 

The proposed similarity has the potential to be incorporated into enumerating 
for generating relations between objects. It shows how to uncover underlying 
strength relations by exploiting hit counts of search engine, but this work do not 
consider length of queries. Therefore, near future work is to further experiment 
the proposed similarity and look into the possibility of enhancing the perfor- 
mance of measurements in some cases. 
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